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Computational  Methods 
for  Sparse  Solution  of  Linear  Inverse  Problems 

Joel  A.  Tropp  and  Stephen  J.  Wright 


Abstract — In  sparse  approximation  problems,  the  goal  is  to  find 
an  approximate  representation  of  a  target  signal  using  a  linear 
combination  of  a  few  elementary  signals  drawn  from  a  fixed 
collection.  This  paper  surveys  the  major  algorithms  that  are  used 
for  solving  sparse  approximation  problems  in  practice.  Specific 
attention  is  paid  to  computational  issues,  to  the  circumstances 
in  which  individual  methods  tend  to  perform  well,  and  to  the 
theoretical  guarantees  available.  Many  fundamental  questions  in 
electrical  engineering,  statistics,  and  applied  mathematics  can 
be  posed  as  sparse  approximation  problems,  which  makes  the 
algorithms  discussed  in  this  paper  versatile  tools  with  a  wealth 
of  applications. 

Index  Terms — sparse  approximation,  compressed  sensing, 
matching  pursuit,  convex  optimization 

I.  Introduction 

LINEAR  inverse  problems  arise  throughout  engineering 
and  the  mathematical  sciences.  In  most  applications, 
these  problems  are  ill-conditioned  or  underdetermined,  so 
we  must  apply  additional  regularizing  constraints  in  order  to 
obtain  interesting  or  useful  solutions.  The  last  two  decades 
have  witnessed  an  explosion  of  interest  in  regularization  via 
sparsity  constraints.  That  is,  we  seek  approximate  solutions 
to  linear  systems  where  the  unknown  has  few  nonzero  entries 
relative  to  its  dimension: 

Find  sparse  x  such  that  <E.r;  ss  u. 

where  it  is  a  target  signal  and  $  is  a  known  matrix. 
Generically,  this  formulation  is  referred  to  as  sparse  approx¬ 
imation  [1],  These  problems  arise  in  many  areas,  including 
statistics,  signal  processing,  machine  learning,  coding  theory, 
and  approximation  theory.  Compressive  sampling  refers  to  a 
specific  type  of  sparse  approximation  problem  first  studied 
in  [2],  [3], 

Tykhonov  regularization,  the  classical  device  for  solving 
linear  inverse  problems,  controls  the  energy  of  the  unknown 
vector  (i.e.,  the  Euclidean  norm).  This  approach  leads  a  linear 
least-squares  problem  whose  solution  is  generally  nonsparse. 
To  obtain  sparse  solutions,  we  must  develop  more  sophis¬ 
ticated  algorithms  and — often — commit  more  computational 
resources.  The  effort  pays  off.  Recent  research  has  demon¬ 
strated  that,  in  many  cases  of  interest,  there  are  algorithms 
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that  can  correctly  solve  large  sparse  approximation  problems 
in  reasonable  time. 

In  this  paper,  we  give  an  overview  of  algorithms  for  sparse 
approximation,  describing  their  computational  requirements 
and  the  relationships  between  them.  We  also  discuss  the 
types  of  problems  where  each  method  is  most  effective  in 
practice.  Finally,  we  sketch  the  theoretical  results  that  justify 
the  application  of  these  algorithms. 

Subsection  I-A  describes  “ideal”  formulations  of  sparse 
approximation  and  some  common  features  of  algorithms  for 
approaching  these  problems.  Section  II  provides  additional 
detail  about  greedy  pursuit  methods.  Section  III  presents 
formulations  based  on  convex  optimization  and  algorithms  for 
solving  these  convex  programs.  Finally,  Section  IV  surveys 
some  new  horizons  worth  exploring. 

A.  Formulations 

Suppose  that  3>  £  Rmxiv  is  a  real  matrix  whose  columns 
have  unit  Euclidean  norm:  1 1 1 1 2  =  1  for  j  =  1,2,..., A. 
(The  normalization  does  not  compromise  generality.)  This 
matrix  is  often  referred  to  as  a  dictionary.  The  “entries”  in 
the  dictionary  are  the  columns  of  the  matrix,  and  a  column 
submatrix  is  called  a  subdictionary. 

The  counting  function  ||  •  ||o  :  M.N  — >  R.  returns  the  number 
of  nonzero  components  in  its  argument.  We  say  that  a  vector  x 
is  s-sparse  when  ||a:||o  <  s.  When  it  =  $x,  we  refer  to  x  as 
a  representation  of  the  signal  it  with  respect  to  the  dictionary. 

The  most  basic  problem  we  consider  is  to  produce  a 
maximally  sparse  representation  of  an  observed  signal  it: 

min||£c||o  subject  to  <&x  =  it.  (1) 

X 

One  natural  variation  is  to  relax  the  equality  constraint  to  allow 
some  error  tolerance  e  >  0,  in  case  the  observed  signal  is 
contaminated  with  noise: 

min||£c||0  subject  to  ||$tc  —  it  ||2  <  e.  (2) 

X 

It  is  most  common  to  measure  the  prediction-observation 
discrepancy  with  the  Euclidean  norm,  but  other  metrics  may 
also  be  appropriate. 

The  elements  of  (2)  can  be  combined  in  several  ways  to 
obtain  related  problems.  For  example,  we  can  seek  the  minimal 
error  possible  at  a  given  level  of  sparsity  s  >  1: 

min  ||  —  it|| 2  subject  to  \\x  Ho  <  S.  (3) 

X 

We  can  also  use  a  parameter  A  >  0  to  balance  the  twin 
objectives  of  minimizing  both  the  error  and  the  sparsity: 

mini||$a;-ii||i  +  A||a;||o. 

x  Z 
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If  there  are  no  restrictions  on  the  dictionary  $  and  the 
signal  m,  then  sparse  approximation  is  at  least  as  hard  as 
a  general  constraint  satisfaction  problem.  Indeed,  for  fixed 
constants  C,  K  >  1,  it  is  NP-hard  to  produce  a  (Cs)-sparse 
approximation  whose  error  lies  within  a  factor  I\  of  the 
minimal  s-term  approximation  error  [4,  Sec.  0.8.2], 

Nevertheless,  over  the  last  decade,  researchers  have  identi¬ 
fied  many  interesting  classes  of  sparse  approximation  prob¬ 
lems  that  submit  to  computationally  tractable  algorithms. 
These  striking  results  help  to  explain  why  sparse  approxima¬ 
tion  has  such  an  important  and  popular  topic  of  research  in 
recent  years. 

B.  Structured  Sparse  Models 

Sparse  approximation  has  become  increasingly  important 
as  it  has  become  clear  that  sparsity  constraints  are  pervasive. 
Here,  we  focus  on  how  sparse  models  affect  the  efficiency 
of  algorithms.  For  more  details  on  sparse  modeling,  see  other 
papers  in  this  volume. 

Researchers  in  mathematical  signal  processing  have  demon¬ 
strated  convincingly  that  many  naturally  occurring  signals  are 
sparse  with  respect  to  dictionaries  that  can  be  constructed  with 
methods  from  harmonic  analysis  [5],  For  example,  natural 
images  can  be  approximated  with  relatively  few  wavelet 
coefficients.  As  a  consequence,  in  many  sparse  approximation 
problems,  the  dictionary  $  has  tremendous  structure  and  offers 
fast  matrix-vector  multiplications. 

In  compressive  sampling,  we  typically  view  <1>  as  the 
product  of  a  random  observation  matrix  and  a  fixed  orthogonal 
matrix  that  determines  a  basis  in  which  the  signal  is  sparse. 
For  large-scale  compressive  sampling  problems,  it  is  essential 
that  the  observation  matrix  and  sparsity  basis  both  admit  an 
efficient  matrix-vector  multiply,  or  else  algorithms  will  be 
hopelessly  slow. 

C.  Major  Algorithmic  Approaches 

There  are  five  major  classes  of  computational  techniques  for 
solving  sparse  approximation  problems: 

1)  Greedy  pursuit.  Iteratively  refine  a  sparse  solution  by 
successively  identifying  one  or  more  components  that 
yield  the  greatest  improvement  in  quality  [6], 

2)  Convex  relaxation.  Replace  the  combinatorial  problem 
with  a  convex  optimization  problem.  Solve  the  convex 
program  with  algorithms  that  exploit  the  problem  struc¬ 
ture  [1], 

3)  Bayesian  methods.  Assume  a  prior  distribution  for  the 
unknown  coefficients  that  favors  sparsity.  Develop  a 
maximum  a  posteriori  estimator  that  incorporates  the 
observation.  Identify  a  region  of  significant  posterior 
mass  [7]  or  average  over  most-probable  models  [8]. 

4)  Nonconvex  optimization.  Relax  the  problem  to  a 
related  nonconvex  problem  and  attempt  to  identify  a 
stationary  point  [9]. 

5)  Brute  force.  Search  through  all  possible  support  sets, 
possibly  using  cutting-plane  methods  to  reduce  the  num¬ 
ber  of  possibilities  [10,  Sec.  3. 7-3. 8], 


This  article  focuses  on  greedy  pursuits  and  convex  opti¬ 
mization.  These  two  methods  have  the  advantages  that  they 
are  computationally  practical  and  lead  to  provably  correct 
solutions  under  well-defined  conditions.  Bayesian  methods  and 
nonconvex  optimization  are  based  on  sound  principles,  but 
they  do  not  currently  offer  theoretical  guarantees.  Brute  force 
is,  of  course,  algorithmically  correct,  but  it  remains  plausible 
only  for  small-scale  problems. 

D.  Verifying  Correctness 

Researchers  have  identified  several  tools  which  can  be  used 
to  prove  that  sparse  approximation  algorithms  produce  optimal 
solutions  to  sparse  approximation  problems.  These  ideas  also 
have  an  impact  on  the  efficiency  of  computational  algorithms, 
so  the  theoretical  background  merits  a  summary. 

The  uniqueness  of  sparse  representations  is  equivalent  to  an 
algebraic  condition  on  submatrices  of  <1>.  Suppose  a  signal  u 
has  two  different  s-sparse  representations  xt  and  x2.  Clearly, 
we  have 

u  =  <&X\  =  &x2  ==>  —  xf)  =  0. 

In  words,  $  maps  a  nontrivial  (2s)-sparse  signal  to  zero.  It 
follows  that  s-sparse  representations  are  unique  if  and  only  if 
each  (2s)-column  submatrix  of  3>  is  injective. 

To  ensure  that  sparse  approximation  is  computationally 
tractable,  we  need  stronger  assumptions  on  <1>.  Not  only  should 
sparse  signals  be  uniquely  determined,  but  they  should  be  sta¬ 
bly  determined.  Consider  a  signal  perturbation  A u  and  an  s- 
sparse  coefficient  perturbation  Ax,  related  by  Am  =  #(Aa;). 
Stability  requires  that  || A£c|| 2  and  ||Am||2  are  comparable. 

This  property  is  commonly  imposed  by  fiat.  We  say  that 
the  matrix  4>  satisfies  the  restricted  isometry  property  (RIP) 
of  order  K  with  constant  5  =  5k  <  1  if 

IMIo  <  K  =*  (l-5)||*||i<|Ml<(l  +  5)rt.  (4) 

This  concept  was  introduced  in  the  important  paper  [11],  For 
sparse  approximation,  we  hope  (4)  holds  for  large  K. 

The  RIP  can  be  verified  using  the  coherence  statistic  of  the 
matrix  3>,  which  is  defined  as 

p  =  max|(v?j,¥>fc)|. 

An  elementary  argument  [12]  via  Gershgorin’s  circle  theorem 
establishes  that  the  RIP  constant  6k  <  pi  K  —  1).  In  signal 
processing  applications,  it  is  common  that  p  «  to-1/2,  so  we 
have  nontrivial  RIP  bounds  for  K  «  i/to.  Unfortunately,  no 
known  deterministic  matrix  yields  a  substantially  better  RIP. 
Early  references  for  coherence  include  [6],  [13]. 

Certain  random  matrices,  however,  satisfy  much  stronger 
RIP  bounds  with  high  probability.  Gaussian  matrices, 
Bernoulli  matrices,  and  random  sections  of  Fourier  matrices 
typically  satisfy  RIP  when  I\  «  m/  \ogp(N)  for  a  small 
integer  p.  This  fact  explains  the  benefit  of  randomness  in 
compressive  sampling.  Establishing  the  RIP  for  a  random 
matrix  requires  techniques  more  sophisticated  than  the  simple 
coherence  arguments.  See  [11]  for  discussion. 

Recently,  researchers  have  observed  that  sparse  matrices 
may  satisfy  a  related  property,  called  RIP-1,  even  when  they 
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do  not  satisfy  (4).  RIP-1  can  also  be  used  to  analyze  sparse 
approximation  algorithms.  See  [14]  for  details. 

E.  Key  Cross-Cutting  Issues 

Structural  properties  of  the  matrix  $  have  a  substantial 
impact  on  the  implementation  of  sparse  approximation  algo¬ 
rithms.  In  most  applications  of  interest,  the  large  size  or  lack 
of  sparseness  in  3>  makes  it  impossible  to  store  this  matrix  (or 
any  substantial  submatrix)  explicitly  in  computer  memory.  It  is 
often  the  case,  however,  that  matrix-vector  products  involving 
<1>  and  can  be  performed  efficiently.  For  example,  the 
cost  of  these  products  is  O  (IV  log  IV)  when  3>  is  constructed 
from  Fourier  or  wavelet  bases.  For  algorithms  that  solve  least- 
squares  problems,  a  fast  multiply  is  particularly  important 
because  it  allows  us  to  use  iterative  methods  such  as  LSQR 
or  conjugate  gradient  (CG).  Nevertheless,  all  the  algorithms 
discussed  below  can  be  implemented  in  a  way  that  requires 
access  to  $  only  through  matrix-vector  products. 

Spectral  properties  of  subdictionaries,  such  as  those  en¬ 
capsulated  in  (4),  have  additional  implications  for  the  com¬ 
putational  cost  of  sparse  approximation  algorithms.  In  par¬ 
ticular,  many  methods  exhibit  fast  asymptotic  convergence 
because  the  RIP  ensures  that  the  subdictionaries  encountered 
during  execution  have  superb  conditioning.  This  point  is  most 
evident  with  algorithms  that  solve  least-squares  problems 
iteratively  because  LSQR  and  CG  are  most  efficient  with  well- 
conditioned  matrices.  Other  approaches  (for  example,  interior- 
point  methods)  are  less  sensitive  to  spectral  properties,  so  they 
become  more  competitive  when  the  RIP  is  less  pronounced  or 
the  target  signal  is  not  particularly  sparse. 

II.  Pursuit  Methods 

A  pursuit  method  for  sparse  approximation  is  a  greedy 
approach  that  iteratively  refines  the  current  estimate  for  the 
coefficient  vector  x  by  modifying  one  or  several  coefficients 
that  yield  a  substantial  improvement  in  approximating  the 
signal.  We  begin  with  a  description  of  the  simplest  effec¬ 
tive  greedy  algorithm,  orthogonal  matching  pursuit  (OMP), 
and  its  theoretical  guarantees.  Afterward,  we  outline  a  more 
sophisticated  method,  called  CoSaMP,  and  its  theory.  We 
conclude  with  some  general  comments  about  the  role  of  greedy 
algorithms  in  sparse  approximation. 

A.  Orthogonal  Matching  Pursuit 

Orthogonal  matching  pursuit  is  one  of  the  earliest  methods 
for  sparse  approximation.  The  basic  references  in  the  signal 
processing  literature  are  [15],  [16],  but  the  idea  can  be  traced 
to  1950s  work  on  variable  selection  in  regression  [10], 

Figure  1  contains  a  mathematical  description  of  OMP.  The 
symbol  <t>o  denotes  the  subdictionary  indexed  by  a  subset 
Oc  {1,2,... ,1V}. 

In  a  typical  implementation  of  OMP,  the  identification  step 
is  the  most  expensive  part  of  the  computation.  The  most 
direct  approach  computes  the  maximum  inner  product  via  the 
matrix-vector  multiplication  <&*rk-i,  which  costs  O (mN)  for 
an  unstructured  dense  matrix.  Some  authors  have  proposed 


Fig.  1.  Orthogonal  Matching  Pursuit  (OMP) 

•  Input.  A  signal  u  G  Rm,  a  matrix  <f>  G  MmxAr 

•  Output.  A  sparse  coefficient  vector  x  G  Rw 

1)  Initialize.  Set  the  index  set  Qq  =  0.  the  residual 
r o  =  u,  and  put  the  counter  k  =  1. 

2)  Identify.  Find  a  column  nk  of  T>  that  is  most  strongly 
correlated  with  the  residual: 

ttk  G  argmax„  |(rfe_1,V3„)|  and 
life  =  — 1  U  {rtfc{ . 

3)  Estimate.  Find  the  best  coefficients  for  approximat¬ 
ing  the  signal  with  the  columns  chosen  so  far. 

xk  =  arg mmy  ||w  -  $nky ||2- 

4)  Iterate.  Update  the  residual: 

rk  =  u  -  $nkxk- 

Increment  k.  Repeat  (2)-(4)  until  stopping  criterion 
holds. 

5)  Output.  Return  the  vector  x  with  components 

x(n )  =  xk(n)  for  n  G  f lk  and  x(n)  =  0  otherwise. 


using  nearest-neighbor  data  structures  to  perform  the  identi¬ 
fication  query  more  efficiently  [17].  In  certain  applications, 
such  as  ridge  regression,  the  “columns”  of  d>  are  indexed  by 
a  continuous  parameter,  and  identification  can  be  posed  as  a 
low-dimensional  optimization  problem. 

The  estimation  step  requires  the  solution  of  a  least-squares 
problem.  The  most  common  technique  is  to  maintain  a  QR 
factorization  of  ‘Fq,.,  which  has  a  marginal  cost  of  0(mk)  in 
the  kth  iteration.  The  new  residual  rk  is  a  by-product  of  the 
least-squares  problem,  so  it  requires  no  extra  computation. 

There  are  several  natural  stopping  criteria. 

•  Halt  after  a  fixed  number  of  iterations:  k  =  s. 

•  Halt  when  the  residual  has  small  magnitude:  |r/,.  ||2  <  e. 

•  Halt  when  no  column  explains  a  significant  amount  of 
energy  in  the  residual:  |j<I>*r/s_1||00  <  £. 

These  criteria  can  all  be  implemented  at  minimal  cost. 

As  with  other  algorithms,  OMP  can  take  advantage  of 
a  dictionary  that  offers  a  fast  matrix-vector  product;  see 
Section  I-E. 

A  huge  number  of  related  greedy  pursuit  algorithms  have 
been  proposed  in  the  literature;  we  cannot  do  them  justice 
here.  Some  noteworthy  variants  include  matching  pursuit  [6], 
the  relaxed  greedy  algorithm  [18],  and  the  /)  -penalized  greedy 
algorithm  [19]. 

B.  Guarantees 

OMP  produces  the  residual  rm  =  0  after  m  steps  (provided 
that  the  dictionary  can  represent  the  signal  u  exactly),  but  this 
representation  hardly  qualifies  as  sparse.  Classical  analyses  of 
greedy  pursuit  focus  instead  on  the  rate  of  convergence. 
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Greedy  pursuits  often  converge  linearly  with  a  rate  that 
depends  on  how  well  the  dictionary  covers  the  sphere  [6]. 
For  example,  OMP  offers  the  estimate 

IMI2  <  (1  -  g2)k/2  IMI2,  where 
Q  =  inf||„||2=1sup„  \{v,<pn)\- 

See  [16,  Sec.  3]  for  details.  Unfortunately,  the  covering  param¬ 
eter  q  is  O (ra-1/2)  unless  N  =  0(em).  Since  p  is  typically 
quite  small,  this  type  of  result  has  limited  interest. 

A  second  type  of  result  demonstrates  that  the  rate  of 
convergence  depends  on  how  well  the  dictionary  expresses  the 
signal  of  interest  [18,  Eqn.  (1.9)].  For  example,  OMP  offers 
the  estimate 

H^fc II2  <  k~1/2  ||iz||#,  where 
IMI*  =  inf { 1 1 cc 1 1 1  :  u  =  3>ai}. 

The  dictionary  seminorm  ||  •  ||#  is  typically  small  when 
its  argument  has  a  good  sparse  approximation.  For  further 
improvements  on  this  estimate,  see  [20].  This  bound  is  usually 
superior  to  the  exponential  rate  estimate,  but  it  can  be  disap¬ 
pointing  for  signals  with  excellent  sparse  approximations. 

Subsequent  work  established  that  greedy  pursuit  produces 
near-optimal  sparse  approximations  with  respect  to  incoherent 
dictionaries  [17],  [21].  For  example,  if  3 pk  <  1,  then 

IKII2  <  Vl  +  Gk\\u-  a*k\\2, 

where  a\  denotes  the  best  1 2  approximation  of  it  as  a  linear 
combination  of  k  columns  from  3>.  See  [22],  [23],  [24]  for 
refinements. 

Finally,  we  remark  that,  if  is  sufficiently  random,  then 
OMP  provably  recovers  sparse  signals.  See  [25]. 

C.  CoSciMP 

Research  on  greedy  pursuits  has  recently  culminated  in  a 
new  algorithm,  called  CoSaMP,  that  offers  optimal  perfor¬ 
mance  guarantees.  The  method  was  designed  for  compressive 
sampling,  but  it  also  offers  attractive  guarantees  for  classi¬ 
cal  sparse  approximation  problems.  CoSaMP  economizes  on 
matrix-vector  multiplications,  so  it  is  most  valuable  when 
these  products  can  be  computed  efficiently  [26].  A  related 
algorithm  appears  in  [27]. 

Figure  2  contains  a  description  of  CoSaMP.  The  symbol 
[x]r  denotes  the  restriction  of  x  to  the  r  components  largest 
in  magnitude,  ties  broken  lexicographically. 

To  ensure  that  CoSaMP  behaves  well,  we  assume  through¬ 
out  this  section  that  the  dictionary  3>  verifies  the  RIP  (4)  of 
order  2s  with  constant  52 s  1.  As  noted,  this  hypothesis 

holds  for  reasonable  values  of  s  when  3>  is  incoherent  or 
suitably  random.  Of  course,  the  algorithm  can  be  applied 
without  the  RIP,  but  its  performance  may  be  unpredictable. 

In  the  implementation,  it  is  important  to  use  an  iterative 
algorithm  to  solve  the  least-squares  problem  in  the  estimation 
step.  The  RIP  ensures  that  these  subproblems  are  always 
well  conditioned,  so  iterative  methods  converge  quickly.  As 
a  consequence,  each  outer  iteration  of  the  algorithm  requires 
at  most  a  constant  number  of  matrix-vector  multiplications. 


Fig.  2.  Compressive  Sampling  Matching  Pursuit  (CoSaMP)  variant 

•  Input.  A  signal  it  g  Rm,  a  matrix  <f>  g  RmxN, 
sparsity  parameter  s 

•  Output.  An  s-sparse  coefficient  vector  x  g  R A' 

1)  Initialize.  Set  the  initial  coefficient  vector  Xq  =  0 
and  the  residual  r(j  =  it.  Let  k  =  1. 

2)  Identify.  Find  2s  columns  of  d>  that  are  most  strongly 
correlated  with  the  residual: 

ft  g  argmin|Ti<2s  V]  \(rk-i,<pn)\- 

3)  Estimate.  Find  the  best  coefficients  for  approximat¬ 
ing  the  residual  with  the  chosen  columns: 

yk  =  argmin^  ||rfc_i  -  &ny\\2 

4)  Prune.  Combine  the  old  and  new  coefficient  vectors 
and  retain  the  s  largest  components: 

z  =  xk-i  +  Vk  and  xk  =  [z]a. 

5)  Iterate.  Update  the  residual: 

rk  =  u-  $xk. 

Repeat  (2)— (5)  until  stopping  criterion  holds. 

6)  Output.  Return  x  =  xk. 


and  this  cost  dominates  the  computation.  See  Section  I-E  for 
further  discussion  of  this  point. 

Needell  and  Tropp  demonstrate  that  each  iteration  of  the 
CoSaMP  algorithm  reduces  the  approximation  error  by  a 
constant  factor  until  it  approaches  its  minimal  value.  To  be 
specific,  suppose  the  signal  u  =  <frx*  +  e  for  arbitrary 
coefficient  vector  x*  and  noise  term  e.  If  we  run  the  algorithm 
for  a  sufficient  number  of  iterations,  the  output  x  satisfies 

\\x *  -  x\\2  <  Cs~1,2\\x*  -  [£B*]s/2||1  +  C||e||2,  (5) 

where  C  is  a  constant.  No  algorithm  can  produce  an  essentially 
smaller  error  for  general  input  signals. 

This  error  bound  allows  us  to  develop  stopping  criteria 
for  CoSaMP  that  are  tailored  to  the  signals  of  interest.  For 
example,  when  the  sorted  entries  of  the  coefficient  vector 
x*  decay  polynomially,  it  can  be  verified  that  the  algorithm 
requires  only  O(logfV)  iterations. 

It  is  worth  noting  a  strong  similarity  between  CoSaMP  and 
iterative  thresholding  algorithms.  See  [28]  for  some  related 
results. 

D.  Commentary 

Greedy  pursuit  methods  have  often  been  considered  naive, 
in  part  because  there  are  contrived  examples  where  the 
greedy  approach  fails  spectacularly.  (See  [1,  Sec.  2.3.2]  for 
an  exposition  of  this  claim.)  Recent  research  has  vindicated 
greedy  pursuits  by  demonstrating  that  they  succeed  in  many 
of  the  situations  where  convex  relaxation  works.  Still,  it  is 
misleading  to  think  of  greedy  methods  and  convex  relaxation 
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methods  as  distinct  approaches  to  sparse  approximation.  In¬ 
deed,  the  greedy  selection  technique  is  closely  related  to  dual 
coordinate  ascent  algorithms  (see  Section  III-F).  Similarly, 
certain  methods  for  convex  relaxation,  such  as  LARS  [29] 
and  homotopy  [30],  use  a  type  of  greedy  selection  at  each 
iteration,  which  is  based  on  the  desire  to  solve  an  underlying 
parametrized  optimization  problem. 

Greedy  pursuits  and  related  methods  (such  as  homotopy) 
are  sometimes  quite  fast,  especially  in  the  ultrasparse  regime 
in  which  the  number  of  nonzeros  in  the  representation  is 
very  small.  Greedy  techniques  can  sometimes  be  applied  to 
situations  in  which  convex  relaxation  cannot  be  used.  For  ex¬ 
ample,  when  the  dictionary  contains  a  continuum  of  elements, 
the  greedy  approach  can  reduce  sparse  approximation  to  a 
sequence  of  simple  one-dimensional  optimization  problems. 

Another  advantage  of  greedy  methods  is  that  they  can 
incorporate  constraints  that  do  not  fit  naturally  into  convex 
programming  formulations.  For  example,  the  data  stream 
community  has  proposed  efficient  algorithms  for  computing 
near-optimal  histograms  and  wavelet-packet  approximations 
from  compressive  samples  [4],  More  recently,  it  has  been 
shown  that  CoSaMP  can  be  modified  to  enforce  tree-like 
constraints  on  wavelet  coefficients;  extensions  to  simultaneous 
sparse  approximation  problems  have  also  been  developed  [31]. 
This  is  an  exciting  and  important  line  of  work. 

III.  Optimization 

Another  fundamental  approach  to  sparse  approximation 
replaces  the  combinatorial  £0  function  in  the  mathematical 
programs  from  Subsection  I-A  with  the  l\  norm,  which  yields 
convex  optimization  problems  that  admit  tractable  algorithms. 
In  a  concrete  sense  [32],  the  t\  norm  is  the  closest  convex 
function  to  the  t$  function,  so  this  relaxation  is  very  natural. 

The  convex  relaxation  of  the  equality-constrained  problem 
becomes 

min  ||tc|| !  subject  to  4>ai  =  u ,  (6) 

X 

and  the  mixed  formulation  becomes 

min  -  u\\l  +  t|M|i-  (7) 

x  Z 

Here,  r  >  0  is  a  regularization  parameter  whose  value  governs 
the  sparsity  of  the  solution:  large  values  typically  produce 
sparser  results.  It  may  be  difficult  to  select  an  appropriate 
value  for  r  in  advance,  since  it  controls  the  sparsity  indirectly. 
As  a  consequence,  we  often  need  to  solve  (7)  repeatedly 
for  different  choices  of  this  parameter,  or  even  to  trace  the 
complete  path  of  solutions  as  r  decreases  toward  zero.  When 
r  >  ||#*m||oo>  the  solution  of  (7)  is  x  =  0. 

Another  variant  is  the  LASSO  formulation  [33],  which  first 
arose  in  the  context  of  variable  selection: 

min  ||<&a;  —  u\\\  subject  to  \\x  Hi  <0.  (8) 

X 

The  LASSO  is  equivalent  to  (7)  in  the  sense  the  path  of 
solutions  to  (8)  parameterized  by  positive  (3  matches  the 
solution  path  for  (7)  as  r  varies.  Finally,  we  note  another 
common  formulation 

min||ai||i  subject  to  ||4>at  —  u  ||2  <  e  (9) 

X 


that  explicitly  parameterizes  the  error  constraint. 

A.  Guarantees 

It  has  been  demonstrated  that  convex  relaxation  methods 
produce  optimal  or  near-optimal  solutions  to  sparse  approxi¬ 
mation  problems  in  a  variety  of  settings. 

The  earliest  results  [13],  [12],  [21]  establish  that  the 
equality-constrained  problem  (6)  correctly  recovers  all  s- 
sparse  signals  from  an  incoherent  dictionary  provided  that 
2/i s  <  1.  In  the  best  case,  this  bound  applies  at  the  sparsity 
level  s  «  \fm.  Subsequent  work  [34],  [35],  [23]  showed  that 
the  convex  programs  (7)  and  (9)  can  identify  noisy  sparse 
signals  in  a  similar  parameter  regime. 

The  results  described  above  are  sharp  for  deterministic 
signals,  but  they  can  be  extended  significantly  for  random 
signals  that  are  sparse  with  respect  to  an  incoherent  dictionary. 
The  paper  [36]  proves  that  that  the  equality-constrained  prob¬ 
lem  (6)  can  identify  random  signals,  even  when  the  sparsity 
level  s  has  order  m/ log  to.  Most  recently,  the  paper  [37] 
observed  that  ideas  from  [35],  [38]  imply  that  the  convex 
relaxation  (7)  can  identify  noisy,  random  sparse  signals  in  a 
similar  parameter  regime. 

Results  from  [11],  [39]  demonstrate  that  convex  relaxation 
succeeds  well  in  the  presence  of  the  RIR  For  example,  suppose 
that  the  signal  u  =  <l>x'  +  e  and  the  dictionary  has  RIP 
constant  <j2s  <C  1.  Then  the  solution  x  to  (9)  verifies 

\\x  -  £C*||2  <  Cs-1/2||ai*  -  [sc*] s||i  +  Ce, 

provided  that  £  >  l|e||2  .  Compare  this  bound  with  the  result  (5) 
for  CoSaMP. 

Finally,  let  us  mention  an  alternative  approach  for  analyzing 
convex  relaxation  algorithms  that  relies  on  geometric  proper¬ 
ties  of  the  kernel  of  the  dictionary  [40],  [41],  [42], 

B.  Active  Set  /  Pivoting 

Pivoting  algorithms  explicitly  trace  the  path  of  solutions 
as  the  scalar  parameter  in  (8)  ranges  over  a  set  of  values. 
This  approach  relies  on  the  fact  that  the  solution  to  (8)  is 
a  piecewise-linear  function  of  (3 ,  a  consequence  of  the  fact 
that  the  optimality  (KKT)  conditions  can  be  stated  as  a  linear 
complementarity  problem.  By  referring  to  the  KKT  system, 
we  can  quickly  identify  the  next  “breakpoint”  on  the  solution 
path — the  closest  value  of  [3  where  the  derivative  of  the 
piecewise-linear  function  changes. 

The  homotopy  method  of  Osborne,  Presnell,  and 
Turlach  [30]  follows  this  approach.  The  algorithm  starts 
with  (3  =  0,  where  the  solution  of  (8)  is  identically  zero, 
and  it  progressively  locates  the  next  largest  value  of  / 3  where 
a  column  enters  or  exits  the  active  set.  The  efficiency  of 
the  algorithm  depends  on  updating  and  downdating  a  QR 
factorization  of  the  active  columns.  A  similar  method  [29]  is 
implemented  as  SolveLasso  in  the  SparseLab  toolbox1. 
Similar  approaches  can  be  developed  for  (7). 

If  we  limit  our  attention  to  values  of  (3  where  the  number 
of  nonzero  entries  in  x  remains  small,  the  active  set  approach 

1  http://www.sparselab.stanford.edu 
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is  reasonably  efficient.  If  we  apply  the  homotopy  method  until 
s  nonzero  components  are  identified,  the  workload  consists  of 
approximately  2s  matrix- vector  multiplications  by  3>  or  <!>*, 
together  with  0(ms2)  operations  for  updating  the  factorization 
and  performing  other  linear  algebra  operations.  This  cost  is 
comparable  with  OMP. 

Indeed,  OMP  and  homotopy  are  quite  similar.  In  each  case, 
the  set  of  nonzero  components  is  refined  by  progressively 
adding  components  and  updating  the  solution  of  a  reduced 
linear  least-squares  problem.  The  criterion  for  selecting  com¬ 
ponents  involves  the  inner  products  between  inactive  columns 
of  $  and  the  current  residual  u—&x.  One  notable  difference  is 
that  homotopy  occasionally  rejects  columns  that  have  already 
been  chosen.  See  [29],  [43]  for  other  comparisons. 


Fig.  3.  Sparse  Reconstruction  via  Separable  Approximation  (SpaRSA) 

.  Input.  A  signal  u  G  Rm,  a  matrix  (1>  G  RmxiV, 
regularization  parameter  r  >  0,  initial  estimate  x0 
of  the  representation  vector. 

•  Output.  Coefficient  vector  x  €  R  N 

1)  Initialize.  Set  k  =  1. 

2)  Iterate.  Choose  ak  and  obtain  x/r  from  (10a).  If  an 
acceptance  test  on  x^  is  not  passed,  increase  ak  by 
some  factor  and  repeat. 

3)  Line  Search.  Choose  7^  G  (0, 1]  and  obtain  xk+\ 
from  (10b). 

4)  Test.  If  stopping  criterion  holds,  terminate  with  x  = 
Xk+i-  Otherwise,  set  k  <—  fc  +  1  and  go  to  (2). 


C.  Interior-Point  Methods 


Interior-point  methods  were  among  the  first  approaches 
developed  for  solving  sparse  approximation  problems  by 
convex  optimization.  The  early  algorithms  [1],  [44]  apply 
a  primal-dual  interior-point  framework  where  the  innermost 
subproblems  are  formulated  as  linear  least-squares  problems 
that  can  be  solved  with  iterative  methods.  A  crucial  aspect 
of  this  approach  is  that  the  algorithms  take  advantage  of  fast 
matrix-vector  multiplications.  An  implementation  is  available 
as  pdco  and  SolveBP  in  the  SparseLab  toolbox. 

Other  interior-point  methods  have  also  been  proposed  ex¬ 
pressly  for  compressive  sampling  problems.  The  paper  [45] 
describes  a  primal  log-barrier  approach  for  a  quadratic  refor¬ 
mulation  of  (7): 


min  — 1|  (hat 
X,Z  2 


U.H2  +  rl7 2:  subject  to  z  <  x  <  z. 


The  technique  relies  on  a  specialized  preconditioner  that  al¬ 
lows  the  internal  Newton  iterations  to  be  completed  efficiently 
with  CG.  The  method2  is  available  as  ll_ls.  The  /:) -magic 
package3  [46]  contains  a  primal  log-barrier  code  for  the 
second-order  cone  formulation  (9),  which  includes  the  option 
of  solving  the  innermost  linear  system  with  CG. 

In  general,  interior-point  methods  are  not  competitive  with 
the  gradient  methods  of  Subsection  III-D  on  problems  with 
very  sparse  solutions,  and  they  do  not  benefit  from  warm 
starts  to  the  same  extent  as  the  other  approaches  in  this 
section.  On  the  other  hand,  their  performance  is  relatively 
insensitive  to  the  sparsity  of  the  solution  or  the  value  of  the 
regularization  parameter.  Interior-point  methods  appear  to  be 
more  robust  in  the  sense  that  there  are  few  cases  of  very  slow 
performance  or  outright  failure,  which  sometimes  occurs  with 
other  approaches. 


D.  Gradient  Methods 

Gradient  descents,  also  known  as  first-order  methods,  are 
iterative  algorithms  for  solving  (7)  in  which  the  major  opera¬ 
tion  at  each  iteration  is  to  compute  gradient  of  the  least-squares 

2www.stanford.edu/'~boyd/ll_ls/ 

3www.l  1  -magic.org 


term  at  the  current  iterate,  viz.,  <t>  (T>.'K/.  —  u).  Many  of  these 
methods  compute  the  next  iterative  xk+\  using  the  rules 

xk  :=  arg  min  (z  —  xk)*^*(^xk  —  u) 

+  ^ak\\z-xk\\l  +  T\\z\\1,  (10a) 

xk+i  :=  xk  +  7fe(£Cfe  -  xk)-  (10b) 


for  some  choice  of  scalar  parameters  ak  and  ~jk.  Alternatively, 
we  can  write  the  subproblem  (10a)  as 


x^  :=  arg  min 


2:  -  - $*($atfc  -  u) 

\  ak 

Oik 


(11) 


Figure  3  specifies  a  representative  algorithm,  called 
SpaRSA,  that  falls  into  the  operator-splitting  framework  of 
[47],  Other  aliases  include  iterative  splitting  and  thresholding 
(1ST)  [48]  and  fixed-point  iteration  [49], 

“Standard”  convergence  results  for  these  methods,  e.g., 
[47,  Theorem  3.4],  require  that  sup kak  >  11^*^112/2,  a 
tight  restriction  that  usually  leads  to  slow  convergence  in 
practice.  The  more  practical  variants  described  in  [50]  admit 
smaller  values  of  ak,  provided  that  a  sufficient  decrease  in  the 
objective  in  (7)  occurs  over  a  span  of  successive  iterations. 
One  particular  approach  uses  a  Barzilai-Borwein  formula  that 
chooses  a  value  of  ak  in  the  spectrum  of  <t>  <t> .  When  a;/  fails 
the  acceptance  test  in  Step  2,  the  parameter  ak  is  increased 
(repeatedly,  as  necessary)  by  a  constant  factor.  Steplengths 
7^  =  1  are  used  in  [49]  and  [50]. 

Several  variants  of  the  SpaRSA  have  appeared.  The  “iterated 
hard  shrinkage”  method  of  [51]  sets  ak  =  0  in  (10)  and 
chooses  7^  to  do  a  conditional  minimization  along  the  search 
direction.  TwIST  [52],  a  variant  of  1ST  that  is  significantly 
faster  in  practice,  deviates  from  the  SpaRSA  framework  in  that 
the  previous  iterate  xk_i  also  enters  into  the  step  calculation, 
in  the  manner  of  successive  over-relaxation  approaches  in 
other  areas  of  scientific  computation.  GPSR  [53]  is  another 
approach  for  solving  (7)  that  can  be  viewed  as  a  gradient- 
projection  algorithm  for  the  convex  quadratic  program  ob¬ 
tained  by  splitting  x  into  positive  and  negative  parts. 
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The  SpaRSA  approach  works  well  on  sparse  signals  when 
<1>  satisfies  the  RIR  For  these  problems,  it  tends  to  identify 
the  nonzero  components  of  x  quickly,  after  which  the  method 
behaves  essentially  like  an  iterative  least-squares  method. 
Because  of  the  RIR  these  final  iterates  converge  quickly. 
Computationally,  these  steps  are  quite  similar  to  the  estimation 
step  of  CoSaMP. 

When  the  solution  of  (7)  is  not  particularly  sparse  or 
the  regularization  parameter  r  is  small,  gradient  approaches 
may  be  ineffective,  converging  slowly  or  not  at  all.  In  this 
setting,  warm  starts  can  vastly  improve  the  behavior  of  these 
approaches.  That  is,  we  can  dramatically  reduce  the  number 
of  iterations  when  the  initial  estimate  Xi,  in  Step  1  is  close 
to  the  true  solution.  This  observation  motivates  the  design  of 
continuation  strategies,  in  which  we  solve  (7)  for  a  decreasing 
sequence  of  values  of  r,  using  the  approximate  solution  for 
each  value  as  the  starting  point  for  the  next  subproblem.  It 
is  evident  that  continuation  is  related  to  pivoting  strategies  of 
Subsection  III-B  that  track  individual  changes  in  the  active 
components  of  x  explicitly. 

Some  explicit  continuation  methods  are  described  in  [49], 
[50],  Though  adaptive  strategies  for  choosing  the  sequence  of 
r  values  have  been  proposed,  the  design  of  a  robust,  practical, 
and  theoretically  effective  continuation  algorithm  remains  an 
important  open  question. 

E.  Extensions  of  Gradient  Methods 

Second-order  information  can  be  used  to  enhance  gradient 
projection  approaches  by  taking  approximate  reduced  Newton 
steps  in  the  subset  of  components  of  x  that  appears  to  be 
nonzero.  In  some  approaches  [53],  [50],  this  enhancement  is 
made  only  after  the  first-order  algorithm  is  terminated  as  a 
means  of  removing  the  bias  in  the  formulation  (7)  introduced 
by  the  regularization  term.  Other  methods  [54]  use  second- 
order  information  at  intermediate  steps  of  the  algorithm,  much 
like  two-metric  gradient  projection  [55],  (A  similar  approach 
was  proposed  for  the  related  problem  of  i\  -regularized  logistic 
regression  in  [56].)  To  compute  the  second-order  components 
of  the  steps,  iterative  methods  can  be  used  to  find  approximate 
solutions  to  unconstrained  least-squares  subproblems  involving 
only  the  nonzero  components  of  x.  These  subproblems  are,  of 
course,  closely  related  to  the  ones  that  arise  in  the  matching 
pursuit  algorithms  of  Section  II. 

A  different  type  of  gradient  projection  approach  is  described 
by  [57,  Section  4],  which  considers  the  formulation  (8).  This 
approach  takes  steps  along  the  negative  gradient  of  the  least- 
squares  objective  in  (8),  with  steplength  chosen  by  a  Barzilai- 
Borwein  formula  (with  backtracking  to  ensure  monotonicity) 
and  projects  the  resulting  vector  onto  the  constraint  set  ||£c||  i  < 
(3.  The  ultimate  goal  in  [57]  is  to  solve  (9)  for  a  given  value 
of  e.  Then  a  scalar  equation  is  solved  to  identify  the  value  of 
(3  for  which  the  solution  of  (8)  coincides  with  the  solution  of 
(9)  for  the  given  e. 

Finally,  we  mention  that  optimal  gradient  methods  for 
convex  minimization  [58],  [59],  [60]  can  be  applied  to  solve 
the  formulation  (7).  These  methods  have  many  variants,  but 
they  share  the  goal  of  finding  an  approximate  solution  that  is 


as  close  as  possible  to  the  optimal  set  (as  measured  by  norm- 
distance  or  by  objective  value)  in  a  given  budget  of  iterations. 
(In  contrast,  standard  methods  aim  to  make  significant  progress 
during  each  individual  iteration.)  Optimal  gradient  methods 
typically  generate  several  concurrent  sequences  of  iterates,  and 
they  have  complex  steplength  rules  that  depend  on  some  prior 
knowledge,  such  as  the  Lipschitz  constant  of  the  gradient. 
Specific  works  that  solve  (7)  using  optimal  gradient  methods 
include  the  papers  [61],  [62], 

The  favorable  asymptotic  properties  of  optimal  first-order 
methods  do  not  manifest  themselves  for  problems  with  rela¬ 
tively  large  values  of  r  or  with  very  sparse  solutions.  These 
algorithms  can  be  competitive  with  other  gradient  methods 
when  t  is  small,  although  the  poor  performance  of  methods 
such  as  SpaRSA  on  such  problems  can  be  improved  signifi¬ 
cantly  by  the  use  of  continuation. 

Recently,  Nemirovski  and  coworkers  have  proposed  a  tech¬ 
nique  for  solving  the  formulation  (9)  by  means  of  robust  opti¬ 
mal  gradient  algorithms  for  stochastic  optimization  [63].  This 
approach  interprets  matrix-vector  products  as  expectations  of 
a  random  variable.  The  method  appears  to  be  very  effective 
when  the  dictionary  lacks  a  fast  multiply. 

F.  Dual-Based  Algorithms 

Greedy  pursuit  methods  are  strongly  connected  with  a  dual 
formulation  of  the  problem  (7): 

min  —  llerlln  —  u1  cr  subject  to  —  1  <  <f>*er  <  1.  (12) 

CT  2 

An  active-set  method  for  this  formulation  [64]  solves  a  se¬ 
quence  of  subproblems  where  a  subset  of  the  constraints 
(corresponding  to  a  subdictionary)  is  enforced.  By  converting 
back  to  the  primal,  these  subproblems  can  each  be  expressed  as 
a  least-squares  problem  over  this  subdictionary.  Typically,  the 
subdictionaries  differ  by  a  single  column  from  one  problem  to 
the  next.  These  approaches  have  not  been  explored  extensively 
in  the  optimization  literature. 

IV.  Horizons 

Test  problem  collections  representative  of  sparse  approxi¬ 
mation  problems  encountered  in  practice  are  crucial  to  guiding 
further  development  of  sparse  reconstruction  algorithms.  Most 
algorithmic  papers  report  results  only  on  synthetic  test  cases. 
The  most  significant  effort  in  this  direction  to  date  is  Sparco 
[65],  a  Matlab  environment  for  interfacing  algorithms  and 
constructing  test  problems  that  also  includes  a  variety  of 
problems  gathered  from  the  literature. 

Many  of  the  algorithms  we  describe  in  this  paper  lend 
themselves  well  to  implementation  on  commodity  graphi¬ 
cal  processing  units  (GPUs),  for  certain  matrices  <1>.  This 
inexpensive  hardware  allows  remarkable  increases  in  speed 
over  conventional  CPU  implementations,  for  certain  kinds 
of  compute-intensive,  multithreaded  numerical  computations. 
For  sensing  matrices  consisting  of  randomly  selected  rows  of 
the  discrete  cosine  transformation,  an  implementation  of  the 
SpaRSA  strategy  is  described  in  [66],  Availability  of  a  GPU 
implementation  of  the  FFT  is  crucial  to  the  success  of  this 
approach.  Multicore  and  GPU  implementations  of  first-order 
methods  are  described  also  in  [67]. 
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